Indego is a platform for shared bicycle commuting owned by the City of Philadelphia. The official visitor website for Philadelphia claims that it has become one of the most bike-friendly cities in the US (https://www.visitphilly.com/getting-around/#biking). With the introduction of a bike share platform, the city of Philadelphia has found itself a viable option to generate revenues while also providing its citizens with better connectivity and transportation services.
Customers of Indego have a choice between "classic" and "electric" bikes. The former refers to typical non-electric bikes, and the latter to rechargeable electric bikes. Throughout the city, Indego offers stations with varying amounts of "docks," or parking spaces and charging stations for its electric bikes. A station's ability to store electric bikes would be equal to the number of docks it has. The number of classic bikes that can be parked has no upper restriction.
Indego customers can choose from a variety of passes depending on the subscription level they have. The Guest Pass, also known as the DayPass, Indego30, and Indego365 are the three pass types that are currently in use (non-legacy). It costs \$15.00, \\$20.00, and \$156.00 (excluding tax) for the Day Pass, Indego30, and Indego365, respectively, and grants customers access for 24 hours, 30 days, and 1 year, respectively, for an unlimited number of 60-minute rides across all stations and bike types. After the first 60 minutes of the ride, consumers can extend their ride duration by paying an additional \\$0.20 per minute. The electric bikes cost an extra \$0.20 to unlock for each minute. (https://www.rideindego.com/buy-a-pass/#/)
In this notebook, we set out to perform an analysis of these datasets to identify different usage trends and behaviors and, ultimately, determine how Indego may increase its profitability. To do this, we analyze the data, gather information about overall usage patterns, and make recommendations to increase platform efficiency and give riders a better experience.
Our analysis is broadly divided into two categories: (1) Analysis of Trip Data (2) Analysis of Station Data. In both categories, we aim to find ways to improve operational efficiency and we frame our questions of interests as such.
Which Indego plans (Indego30, Day Pass, Indego365) are most utilized, and which bike types (Standard, Electric) are most used and in demand? Is there a relationship between the two?
Why this matters: Our first analysis question aims to understand if there exists any pattern or relationship between the pass type and bike types, with the intention to assess which bike type the company should invest more in, and which plan types it should focus on promoting. We carry out this analysis since investing in any particular kind of bike has considerable overhead costs, such as procuring electric bikes, installing docks at stations because these electric bikes require charging ports, and marketing and promoting particular pass kinds. As a result, before we can offer any recommendations to the company, we must recognize the "money makers" and evaluate their frequency and popularity in order to provide accurate advice about putting resources into company infrastructure and planning promotions.
How does the traffic vary across different times of the day and across different seasons?
Why this matters: To increase the profitability of the company, we benefit from understanding the usage patterns. Analyzing the frequency of trips helps us understand the target demographic. Using these insights, the company can make better decisions regarding targeted marketing campaigns.
Conducting a health check of bikes: - Identifying bikes that require maintenance and investigating potential issues with the bikes.
Why this matters: To optimise user experience, Indego needs to ensure that its bikes perform smoothly on road. Any problems encountered by a customer while riding a bike could be detrimental to road safety, resulting in regulation violations and expensive lawsuits. As a part of risk minimization, Indego would benefit from identifying any potential issues in the bikes, and from identifying bikes that have been used excessively over the past year, and flagging them as bikes that require maintenance.
What are the optimum rebalancing hours?
Why this matters: The most important part of optimising the operational efficiency is determining when Indego should schedule its rebalancing operations across stations. We would like to ensure that bike availability is consistently maintained across stations whenever a user intends to start a trip from their present station. Imbalance in bike availability and bike shortage is of particular concern during the peak traffic hours. At the same time, high traffic can also make the rebalancing operation sluggish, and cost the company more in terms of personnel and equipment. We would like to find an ideal balance with low traffic and peak crunch hours and provide estimates for the rebalancing hours.
Analyzing station activity and inactivity across time. What criteria are employed by Indego to decide which stations to deactivate and activate? Can we use this information to suggest which stations will have to be deactivated in the future?
Why this matters: The more the number of stations that are infrequently used, the more we add to the company’s overhead. Indego would benefit from deactivating some of its stations from time to time. But before we can make this recommendation, we need to establish patterns from past station status and traffic. Indego activates and deactivates a few of its stations at different points in time. Using the station activity dataset along with the trips dataset for the same quarter across different years or across continuing quarters, we analyze the demand/popularity of bikes at these stations. We use this analysis to recommend any changes to station capacity to improve operational efficiency across different stations.
Indego has provided the general public with quarterly anonymized trip data since its launch in April 2015. The data includes rider details including pass type as well as metrics for each trip (start and finish points, duration, time of day, and bike type used). Additionally, Indego has made information about its stations' locations, capacity, statuses (active or dormant at certain times), and pricing options available to the public.
Indeo shares this data on their official website: https://www.rideindego.com/about/data/ This site has been our source for all the data used in this project. The quarterly data used in our analysis is from the years 2021 and 2022. They are: 2022 Q2 (April – June) 2022 Q1 (January – March) 2021 Q4 (October – December) 2021 Q3 (July – September) 2021 Q2 (April – June) 2021 Q1 (January – March)
Indego also provides the data for the stations which includes the station ID, the name of the station, the date of the day it went live (i.e. the date it became active) and its current status has also been made available by Indego. According to their website, this data was last updated on July 1, 2022.
Each station has a fixed number of docks available for electric and standard bikes. The number of bikes at each station and their types is made available via a GeoJSON file.
All the datasets we used is here:https://drive.google.com/drive/folders/1LJlhc8F3_fe3qE7MLaHHJEfvxsNEP4Jo?usp=sharing
Since the data we have acquired for the purpose of our analysis has been published by Indego with some pre-processing already done, our project is going to be more Analysis heavy. We are also using inferences to make recommendations to the company, and these inferences are obtained primarily from the analyses we conduct on the datasets acquired. Eventhough our work is analysis heavy, there is still considerable cleaning and processing required to suit our analyses.
A description and presentation of the tasks required to process (i.e., clean, transform, enhance, etc.) the data and prepare it for analysis.
Data Wrangling
[1]Subsetting
[2]Transformation
[3]Indexing and filtering
[4]Dataset merging
[5]Creating new variables
The information about the docking capacity at each station, the number of bikes available, and their type needs to be extracted from a GeoJSON file. The data from this file has been extracted and converted into a csv file to aid our analysis.
# import libraries
import pandas as pd
import numpy as np
import re
import datetime as dt
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import plotly.express as px
import urllib
from urllib.request import urlopen
from pandas.io.json import json_normalize #special package in pandas
import json
from lxml import etree
from scipy import stats
#importing station data
url = "https://gbfs.bcycle.com/bcycle_indego/station_status.json"
print(url)
response = urlopen(url)
data = json.loads(response.read())
#dropping the columns 'is_returning', 'is_renting', 'is_installed', 'last_reported'
norm_df = pd.json_normalize(data["data"], record_path='stations').drop(['is_returning', 'is_renting', 'is_installed', 'last_reported'], axis=1)
norm_df['station_id'] = norm_df['station_id'].map(lambda x: re.sub('bcycle_indego_','', x))
norm_df=norm_df.drop_duplicates()
#use the the set index function to use the station id as the index
norm_df = norm_df.set_index('station_id')
norm_df
https://gbfs.bcycle.com/bcycle_indego/station_status.json
| num_docks_available | num_bikes_available | num_bikes_available_types.electric | num_bikes_available_types.smart | num_bikes_available_types.classic | |
|---|---|---|---|---|---|
| station_id | |||||
| 3004 | 16 | 14 | 7 | 0 | 7 |
| 3005 | 10 | 3 | 1 | 0 | 2 |
| 3006 | 13 | 4 | 3 | 0 | 1 |
| 3007 | 12 | 5 | 3 | 0 | 2 |
| 3008 | 5 | 11 | 8 | 0 | 3 |
| ... | ... | ... | ... | ... | ... |
| 3328 | 17 | 4 | 1 | 0 | 3 |
| 3329 | 1 | 16 | 3 | 0 | 13 |
| 3334 | 4 | 15 | 9 | 0 | 6 |
| 3336 | 18 | 2 | 2 | 0 | 0 |
| 3337 | 12 | 9 | 3 | 0 | 6 |
216 rows × 5 columns
Data cleaning is the process that comes after the data ingestion process, reading in the data into an appropriate format. Data cleaning is looking into problems with the data that you want to identify, understand, and potentially fix.
Validate: Does the data make sense? Can a story be told?
Data loaded in from the JSON is normalized into a dataframe.
The datasets for the four most recent quarters found on Indego's website (https://www.rideindego.com/about/data/) are loaded in. These datasets are the most recent and up-to-date. Furthermore, they account for all months and all seasons which are potential independent factors for Indego's consumers.
For cleaning purposes, the "plan_duration" column is dropped as it is deemed unnecessary and repetitive of the column that indicates the plan type. After being concatenated, they summarize a year of Indego data.
# Import data and drop the columns 'plan_duration'
indego_2022q1= pd.read_csv('indego-trips-2022-q1.csv', encoding= 'unicode_escape').drop(['plan_duration'], axis=1)
indego_2022q2 = pd.read_csv('indego-trips-2022-q2.csv', encoding= 'unicode_escape').drop(['plan_duration'], axis=1)
indego_2021q3 = pd.read_csv('indego-trips-2021-q3.csv', encoding= 'unicode_escape').drop(['plan_duration'], axis=1)
indego_2021q4 = pd.read_csv('indego-trips-2021-q4.csv', encoding= 'unicode_escape').drop(['plan_duration'], axis=1)
indego_2021q2 = pd.read_csv('indego-trips-2021-q2.csv')
/var/folders/2n/t3y5_nhj57j7nhbtsrjgsvnc0000gn/T/ipykernel_3609/869977200.py:4: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
indego_2021q3 = pd.read_csv('indego-trips-2021-q3.csv', encoding= 'unicode_escape').drop(['plan_duration'], axis=1)
/var/folders/2n/t3y5_nhj57j7nhbtsrjgsvnc0000gn/T/ipykernel_3609/869977200.py:5: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
indego_2021q4 = pd.read_csv('indego-trips-2021-q4.csv', encoding= 'unicode_escape').drop(['plan_duration'], axis=1)
/var/folders/2n/t3y5_nhj57j7nhbtsrjgsvnc0000gn/T/ipykernel_3609/869977200.py:6: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
indego_2021q2 = pd.read_csv('indego-trips-2021-q2.csv')
#Get a dataframe of a whole year
annual = pd.concat([indego_2022q1, indego_2022q2, indego_2021q3, indego_2021q4])
#Rename some columns with proper name
annual.rename(columns= {'start_station':'start_station_id','end_station':'end_station_id'},inplace= True)
#Split start timw to date and time
annual[['start_date','start_time']] = annual['start_time'].str.split(' ', expand=True)
#Convert from string to datetime
annual['start_time'] = annual['start_time'].map(lambda t:dt.datetime.strptime(t,'%H:%M'))
annual['start_date'] = annual['start_date'].map(lambda t:dt.datetime.strptime(t,'%m/%d/%Y'))
annual['start_time'] = annual['start_time'].map(lambda t:t.time())
#Repeat same steps on end time
annual[['end_date','end_time']] = annual['end_time'].str.split(' ', expand=True)
annual['end_time'] = annual['end_time'].map(lambda t:dt.datetime.strptime(t,'%H:%M'))
annual['end_date'] = annual['end_date'].map(lambda t:dt.datetime.strptime(t,'%m/%d/%Y'))
annual['end_time'] = annual['end_time'].map(lambda t:t.time())
#For analysis of stations abandoned in 2021 quarter 3, we also need data in 2021 q2
indego_2021q2.rename(columns= {'start_station':'start_station_id','end_station':'end_station_id'},inplace= True)
indego_2021q2 = indego_2021q2[indego_2021q2['duration'] < 1400]
#Split start timw to date and time
indego_2021q2[['start_date','start_time']] = indego_2021q2['start_time'].str.split(' ', expand=True)
#Convert from string to datetime
indego_2021q2['start_time'] = indego_2021q2['start_time'].map(lambda t:dt.datetime.strptime(t,'%H:%M'))
indego_2021q2['start_date'] = indego_2021q2['start_date'].map(lambda t:dt.datetime.strptime(t,'%m/%d/%Y'))
indego_2021q2['start_time'] = indego_2021q2['start_time'].map(lambda t:t.time())
indego_2021q2['start_date'] = indego_2021q2['start_date'].map(lambda t:t.date())
#Repeat same steps on end time
indego_2021q2[['end_date','end_time']] = indego_2021q2['end_time'].str.split(' ', expand=True)
indego_2021q2['end_time'] = indego_2021q2['end_time'].map(lambda t:dt.datetime.strptime(t,'%H:%M'))
indego_2021q2['end_date'] = indego_2021q2['end_date'].map(lambda t:dt.datetime.strptime(t,'%m/%d/%Y'))
indego_2021q2['end_time'] = indego_2021q2['end_time'].map(lambda t:t.time())
#Processing datetime columns
indego_2021q3.rename(columns= {'start_station':'start_station_id','end_station':'end_station_id'},inplace= True)
indego_2021q3 = indego_2021q3[indego_2021q3['duration'] < 1400]
indego_2021q3[['start_date','start_time']] = indego_2021q3['start_time'].str.split(' ', expand=True)
indego_2021q3['start_time'] = indego_2021q3['start_time'].map(lambda t:dt.datetime.strptime(t,'%H:%M'))
indego_2021q3['start_date'] = indego_2021q3['start_date'].map(lambda t:dt.datetime.strptime(t,'%m/%d/%Y'))
indego_2021q3['start_time'] = indego_2021q3['start_time'].map(lambda t:t.time())
indego_2021q3['start_date'] = indego_2021q3['start_date'].map(lambda t:t.date())
indego_2021q3[['end_date','end_time']] = indego_2021q3['end_time'].str.split(' ', expand=True)
indego_2021q3['end_time'] = indego_2021q3['end_time'].map(lambda t:dt.datetime.strptime(t,'%H:%M'))
indego_2021q3['end_date'] = indego_2021q3['end_date'].map(lambda t:dt.datetime.strptime(t,'%m/%d/%Y'))
indego_2021q3['end_time'] = indego_2021q3['end_time'].map(lambda t:t.time())
#Checking if the dataset contains duplicate values and eliminate them if any
annual = annual.drop_duplicates(keep='last')
For documentation reference, each quarter contains columns of the following information:
In summary, for data cleaning of our annual and quarter datasets:
The stations dataset is also ingested from Indego's website (https://www.rideindego.com/about/data/). These datasets contain information about the activation status for each station.
For the purpose of the scope of our analysis, we will be merging the three datasets containing station information for the duration of 2021 Q3 (July – September) to 2022 Q2 (April – June). This is because there have been various status changes that we need to account for over time. The information needed for our analysis questions requires the station_id and status columns, the other columns that were created during the merges are dropped. The columns that are kept for analysis are renamed to make it easier to work with.
After merging the three datasets about the stations, we created a function which checks the difference in status of each station in different times (across the time period of 2021 Quarter 3 to 2022 Quarter 2.
#Import the dataset
station_2021_7 = pd.read_csv('indego-stations-2021-07-01.csv')
station_2021_10 = pd.read_csv('indego-stations-2021-10-01.csv')
station_2022_7 = pd.read_csv('indego-stations-2022-07-01.csv')
#Merge the station data to identify the changes on their status
merge1 = pd.merge(station_2021_7,station_2021_10,how='outer',on='Station_ID')
station = pd.merge(merge1, station_2022_7, how='outer', on='Station_ID')
#Drop duplicated columns to make the table more easy to read
station.drop(axis=1,columns=['Station_Name_x', 'Day of Go_live_date_x','Station_Name_y', 'Day of Go_live_date_y'],inplace = True)
#Rename the columns
station = station.rename(columns={'Status':'Status_21_7','Status_x':'Status_21_10','Status_y':'Status_22_7'})
#Define a function to identify the stations with status changed
def change(a,b):
if pd.isna(a) == True or pd.isna(b) == True:
return np.NAN
elif a == b:
return False
else:
return True
#Add a new column, which records if the status changed between 2021-07 and 2021-10
station['change_21_7_10']= station.apply(lambda x:change(x['Status_21_7'],x['Status_21_10']),axis = 1)
#Add a new column, which records if the status changed between 2021-10 and 2022-07
station['change_22_10_7']= station.apply(lambda x:change(x['Status_21_10'],x['Status_22_7']),axis = 1)
As the first part of improving the operational efficiency of Indego, we identify the most utilized Indego plan among Indego30, DayPass and Indego365. We also compare which bike type is more frequently used. We then aim to find if there exists any relationship between the passholder type and the bike type, and find the most popular combination of passholder type and bike type.
#Identify the variable type of 'bike_type' and 'passholder_type'
print(annual['bike_type'].unique())
print(annual['passholder_type'].unique())
['standard' 'electric'] ['Indego365' 'Indego30' 'Day Pass' 'Walk-up']
We first find out the values of the 'bike_type' and 'passholder_type' variables present in the annual dataset we have accumulated and prepared, accounting for a year. Then, we prepare a contingency table to analyze the frequency of occurrences of these values across the trips that have happened for this year.
#get a contigency table
contigency_table = pd.crosstab(annual['bike_type'],annual['passholder_type'],margins=True )
contigency_table
| passholder_type | Day Pass | Indego30 | Indego365 | Walk-up | All |
|---|---|---|---|---|---|
| bike_type | |||||
| electric | 34254 | 250223 | 58979 | 3 | 343459 |
| standard | 48496 | 401930 | 121141 | 1 | 571568 |
| All | 82750 | 652153 | 180120 | 4 | 915027 |
The contingency table obtained in our analysis presents interesting insights.
From these observations we can make some rough inferences regarding user behavior:
Indego customers overwhelmingly prefer to renew their subscription with Indego every month rather than commit to a year-long subscription.
From the contingency table, we found the most popular combination to be Indego30 and Standard bike. We would like to find out if the passholder types and bike types influence each other’s popularity. Since 'bike_type' and 'passholder_type' are categorical variables, we accomplish this by doing a correlation analysis between these two variables with the help of Chi-square test.
#Get the frequency values
electric_count = contigency_table.iloc[0,0:3].values
standard_count = contigency_table.iloc[1,0:3].values
#Calculate chi-square value
freq = np.array([contigency_table.iloc[0,0:3].values,
contigency_table.iloc[1,0:3].values])
print(stats.chi2_contingency(freq)[0:3])
(2482.2419117216205, 0.0, 2)
From the chi square test, we obtain a p value of 0.0. This is strongly suggestive of the fact that passholder types and bike types influence each other’s popularity.
According to the contigency table, people with day pass have the highest preference on electric bikes.
Finding out how the traffic and bike usage varies across different times and different months of the year can give us insights regarding the user demographics. We consider the data across the “annual” dataset and filter trips for the work days (Monday to Friday). We plot the frequency of the trips across different time intervals for a typical work day.
#Frequency across week
annual['dow'] = pd.to_datetime(annual["start_date"]).dt.day_name()
weekdays = ['Monday', "Tuesday", "Wednesday", "Thursday", "Friday"]
mon_to_friday = annual[annual['dow'].isin(weekdays)]
work_hours = annual.copy()
workdays_df = mon_to_friday[['dow', 'start_time']].groupby('start_time').count()
fig = px.line(workdays_df, title='2021 Indego Typical Workday Usage')
fig.show()
From the plot chart obtained above, we observe the following:
We can safely make the following assumptions from the above inferences:
Commuting to workplaces seems to be the most popular reason for Indego customers to avail the bike share services. This is corroborated by the fact that the peak traffic hours coincide with the work hours.
We also plot the frequency of trips across a whole year.
#Create new column for months from date by splitting at '-'
annual['Month'] = annual['start_date'].map(lambda date : date.month)
group =annual.groupby('Month')
act_2021=group.count()
Month=['January','February','March','April','May','June','July','August','September','October','November','December']
act_2021 = act_2021.assign(Months=Month)
fig = px.line(act_2021, x="Months", y="trip_id", title='Distribution of Trips')
fig.update_xaxes(tickangle=300)
fig.show()
From the plot obtained above, we find a consistently increasing trend in the frequency of trips from spring to fall, culminating in September with more than 10% of the total trips in an year happening in September. We also find a steep fall in the frequency of trips from fall to winter, with approximately only 30,000 trips occurring in the month of January. This falls in line with our initial hypothesis that traffic is most likely to fall in harsh months of winter.
To optimise user experience, Indego needs to ensure that its bikes perform smoothly on road. Any problems encountered by a customer while riding a bike could be detrimental to road safety, resulting in regulation violations and expensive lawsuits. As a part of risk minimization, Indego would benefit from identifying any potential issues in the bikes, and from identifying bikes that have been used excessively over the past year, and flagging them as bikes that require maintenance.
We conduct two check as part of this analysis:
In order to identify the bikes that were overused, we used statistical analysis. We were able to get descriptive statistics based on the number of times each bike was used for a ride using the number of occurrences of unique bike_id values. We then generated a boxplot to gain insight into whether or not there were bikes that were being overused, more than expected, and which. This is important to keep in track for maintenance purposes. In doing so, we were able to find the bikes that exceeded the upper bound of the data using interquartile range.
#Descriptive statistics for most used bikes
top_bikes = annual["bike_id"].value_counts()
top_bikes.describe()
count 3340.000000 mean 273.960180 std 134.776615 min 1.000000 25% 171.750000 50% 270.000000 75% 373.000000 max 772.000000 Name: bike_id, dtype: float64
#Finding inter quartile range
q3 = np.quantile(annual["bike_id"].value_counts(), 0.75)
q1 = np.quantile(annual["bike_id"].value_counts(), 0.25)
iqr = q3-q1
# find upper whisker
upper_bound = q3+(1.5*iqr)
print(upper_bound)
674.875
The data frame containing the unique bike_ids considered outliers and the count of trips completed on each are merged to determine which bike type was most overused and would therefore need more attention for maintenance purposes. The bikes that exceeded the upper bound are the “outliers” or bikes that were used unusually more than the rest in the dataset. The number of bikes that were overused are __ with most of them being electric.
# bikes above 675 rides are considered outliers and would need more maintenence
outliers = top_bikes[top_bikes>upper_bound]
outliers_df = outliers.to_frame()
outliers_df = outliers_df.reset_index()
outliers_df = outliers_df.rename(columns={"index": "bike_id","bike_id":"ride_count" })
unique_bikes = annual[["bike_id", "bike_type"]].drop_duplicates()
unique_bikes = unique_bikes.merge(outliers_df, how='inner', on='bike_id')
unique_bikes.shape
(3, 3)
#Bikes identified to schedule maintenance operations
unique_bikes["bike_type"].value_counts()
electric 3 Name: bike_type, dtype: int64
Generally, we predicted that electric bikes require more maintenance due to being used more often. The value counts suggest that the overused or “outlier” bikes are usually electrical bike types which would need more attention for maintenance such as battery replacements for best performance.
To check for any discrepancies in the electric bikes, we compare their usage statistics with standard bikes. We approach this analysis with the hypothesis that if there is any stark difference between the number of trips for electric bikes across short trips or long trips compared to standard bikes, then this is a cause for concern and raises an issue regarding proper functioning and running of the electric bikes. We hypothesize that regardless of the nature of the trip, the number of trips initiated with electric bikes should be comparable to the standard bikes.
#Finding the duration values for different percentiles
electric_bike_rides = annual[annual['bike_type'] == 'electric']
standard_bike_rides = annual[annual['bike_type'] == 'standard']
number_of_electrip_trips = np.sum(annual['bike_type'] == 'electric')
number_of_standard_trips = np.sum(annual['bike_type'] == 'standard')
average_trip_duration = annual['duration'].mean()
i = 1
for i in range(1, 11):
percentile = annual['duration'].quantile(i/10)
print(f' the {i*10}th percentile is {percentile}')
lower_quartile = annual['duration'].quantile(0.25)
print(f'the value of the 25th percentile is {lower_quartile}')
upper_quartile = annual['duration'].quantile(0.75)
print(f'the value of the 75th percentile is {upper_quartile}')
the 10th percentile is 4.0 the 20th percentile is 6.0 the 30th percentile is 7.0 the 40th percentile is 9.0 the 50th percentile is 10.0 the 60th percentile is 13.0 the 70th percentile is 16.0 the 80th percentile is 21.0 the 90th percentile is 33.0 the 100th percentile is 1440.0 the value of the 25th percentile is 6.0 the value of the 75th percentile is 18.0
We decide on an upper and lower cut off to define what constitutes a short trip or a long trip and visualise the frequency of trips across bike types for these categories.
#Finding the frequency of trips across bike types for short duration type trips
short_electric_trips = electric_bike_rides[electric_bike_rides['duration'] <= 2]
short_standard_trips = standard_bike_rides[standard_bike_rides['duration'] <= 2]
print(electric_bike_rides['duration'].mean())
print(standard_bike_rides['duration'].mean())
18.17615494134671 17.842059037594826
#Finding the frequency of trips across bike types for long duration type trips
long_electric = np.sum(electric_bike_rides['duration'] >= 1200)
print(long_electric)
long_standard = np.sum(standard_bike_rides['duration'] >= 1200)
print(long_standard)
266 267
# we use the lower quartile of the durations for a trip as a threshold
short_electric = np.sum(electric_bike_rides['duration'] <= 6)
print(short_electric)
short_standard = np.sum(standard_bike_rides['duration'] <= 6)
print(short_standard)
94337 136886
bike_type = ['electric', 'standard']
short_rides_count = [short_electric, short_standard]
long_rides_count = [long_electric, long_standard]
x1 = bike_type
y1 = short_rides_count
x2 = bike_type
y2 = long_rides_count
# plt.rcParams["figure.figsize"] = (10, 5)
plt.figure()
# adding first subplot
plt.subplot(121)
plt.bar(x1, y1, width = 0.6, color='#f4c2c2')
plt.grid(axis='y', alpha=0.75)
plt.title('number of short trips (less than 6 minutes)')
plt.xlabel('type of indego bike')
plt.ylabel('count')
# adding second subplot
plt.subplot(122)
plt.bar(x2, y2, width = 0.6, color='#30D5C8')
plt.grid(axis='y', alpha=0.75)
plt.title('number of long trips (more than 1200 minutes)')
plt.xlabel('type of bike')
plt.ylabel('count')
Text(0, 0.5, 'count')
From the plots obtained above, we find that for short-duration trips, more standard bikes are used than electric, but the frequencies are comparable and there is no stark difference.
For the longer duration trips, the frequencies of trips for electric and standard bike types are almost the same. This falls in line with our initial hypothesis and reassures that there aren’t any glaring signs of usage drop in electric bikes.
Finding the optimum rebalancing hours constitutes the most important and the most pressing aspect of operations for a bike share company. This can also be a very complicated question to answer depending on the different factors to be considered. Since Indego has not disclosed any data regarding the number of trucks they have, or the number of bikes that can be accommodate into a single truck, or even regarding the cost of rebalancing operations, we cannot use a statistical analysis to determine the exact hours for rebalancing operations. However, with the arrival and departure rates of bikes across most popular stations, we can get a rough idea about ideal rebalancing hours.
There are two things to consider when rebalancing bikes:
Mitigating bike unavailability: We would like to ensure that bike availability is consistently maintained across stations whenever a user intends to start a trip from their present station. Imbalance in bike availability and bike shortage is of particular concern during the peak traffic hours.
Avoiding high traffic hours: Rebalancing during high traffic can make the rebalancing operation inefficient and slow, and cost the company more in terms of personnel and equipment.
Taking both the above factors in consideration, we neet to find an ideal balance with low traffic and peak crunch hours.
To find the optimal rebalancing hours, we analyze trip frequency for 50 stations that top station activity. We define station activity as the sum of arrival and departure instances. We then find the difference between aggregated departure rate and arrival rate for these stations across different times of the day to visualize the imbalance better.
#Add two columns about which hour the bike comes into the station or out of the station
annual['out_hod'] = annual['start_time'].map(lambda t:t.hour)
annual['in_hod'] = annual['end_time'].map(lambda t:t.hour)
#Get top 50 popular stations
out = annual.groupby(by=annual['start_station_id']).count()['out_hod']#Count how many bikes going out of each station
in_st = annual.groupby(by=annual['end_station_id']).count()['in_hod']#Count how many bikes coming into each station
top50=out+in_st.sort_values(ascending=False)[:50]
#Convert it into a list
list_top_50 = top50.index.to_list()
#Get the crosstab about how many bikes depart from each station
annual_top50_start = annual[annual['start_station_id'].isin(list_top_50)]
out_num = pd.crosstab(index=annual_top50_start['start_station_id'],columns=annual_top50_start['out_hod'],margins=True)
#Drop the 'All' columns which sums up the usage of each station
out_num=out_num.drop(columns=['All'])
#Get the crosstab about how many bikes come into each station
annual_top50_end = annual[annual['end_station_id'].isin(list_top_50)]
in_num = pd.crosstab(index=annual_top50_end['end_station_id'],columns=annual_top50_end['in_hod'],margins=True)
#Drop the 'All' columns which sums up the usage of each station
in_num=in_num.drop(columns=['All'])
diff = out_num - in_num
diff
| out_hod | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| start_station_id | |||||||||||||||||||||
| 3000 | -129 | -105 | -58 | -32 | -45 | -143 | -372 | -580 | -940 | -721 | ... | -833 | -932 | -1064 | -1263 | -1065 | -895 | -541 | -383 | -248 | -197 |
| 3004 | 20 | 17 | 34 | 2 | -2 | 68 | 14 | -61 | -53 | -62 | ... | -89 | -43 | 100 | 66 | 49 | 32 | 49 | 54 | 11 | 6 |
| 3005 | 100 | 50 | 41 | -1 | 4 | 5 | -31 | -36 | -25 | -43 | ... | 51 | -44 | 0 | 19 | -76 | -100 | -48 | 99 | 82 | 65 |
| 3006 | 12 | 38 | 13 | 4 | 13 | 20 | 80 | -84 | 43 | 60 | ... | 14 | 63 | 46 | 107 | 67 | -37 | -59 | -37 | -46 | 7 |
| 3007 | 68 | 121 | 181 | 38 | -16 | 106 | -27 | 82 | 342 | 114 | ... | -73 | -211 | -105 | -48 | -135 | -43 | -92 | -59 | 34 | 120 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3275 | -1 | 1 | -4 | -6 | 0 | -1 | 9 | 25 | 45 | -5 | ... | 1 | -3 | -26 | -34 | -16 | -22 | 1 | -10 | -16 | 1 |
| 3276 | -4 | -10 | -3 | -2 | 1 | 4 | 0 | 18 | 6 | 8 | ... | 1 | 6 | 14 | -21 | 9 | -6 | 4 | -12 | -4 | -10 |
| 3277 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 1 | -1 | -1 | 1 | 1 | 0 | 0 | 0 | 0 |
| 3278 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | -1 | 0 | 0 |
| All | -1434 | -461 | -647 | -709 | -407 | 1025 | 1877 | 3737 | 2150 | -1021 | ... | 1341 | 960 | 2898 | 595 | -1880 | -4292 | -4218 | -2550 | -1552 | -1737 |
185 rows × 24 columns
#Average of top 50 stations
plt.figure(figsize=(15,10))
plt.title('Station Activity across time')
plt.xlabel('Hour')
plt.ylabel('Daily Avg usage')
width = 0.4
plt.bar(diff.columns,diff.iloc[50,:]/(365*50),width=width,label='out-in')
#plt.bar(x=in_num.columns+width,height = in_num.iloc[50,:]/(365*50),width=width,label = 'in')
plt.legend()
<matplotlib.legend.Legend at 0x7febfc0d5ac0>
We find that for the 50 most active stations, the difference between departure and arrival rates peaks between 12 pm and 3 pm and dips most between 6 pm and 9 pm. From the plot above we can infer that bike availability is skewed positively in the 50 most popular stations between 12 pm and 3 pm and skewed negatively between 6 pm-9 pm. This means that for these stations, bikes are available in excess from 12 pm-3 pm and there is a bike crunch between 6 pm and 9 pm. Both of these are instances of bike imbalance and need a rebalancing operation to offset.
From our previous analysis regarding the peak traffic hours, we find bike traffic consistently increasing from 5.30 am to peaking at 9 am. Dipping steeply shortly after then increasing again from 11 am till 5 pm.
Taking these two observations into consideration, we find a rough estimate of the optimum time to schedule rebalancing operations to be one operation scheduled at 12pm and at another one to be scheduled at 9pm
The more the number of stations that are infrequently used, the more we add to the company’s overhead. Indego would benefit from deactivating some of its stations from time to time. But before we can make this recommendation, we need to establish patterns from past station status and traffic. Indego activates and deactivates a few of its stations at different points in time. Using the station activity dataset along with the trips dataset for the same quarter across different years or across continuing quarters, we analyze the demand/popularity of bikes at these stations. We use this analysis to recommend any changes to station capacity to improve operational efficiency across different stations.
To begin, we get the frequency data for each station. This would make it easier for us to plot the line charts later. Since the status changed for two stations, those are the one’s that we will focus on.
#Get a list of status-changed stations
status_changed_usage = station[(station['change_22_10_7']==True)|(station['change_21_7_10']==True)]['Station_ID'].tolist()
status_changed_usage
[3038, 3192, 3213, 3214]
#To see when the status changed
station[station['Station_ID'].isin(status_changed_usage)]
| Station_ID | Status_21_10 | Status_22_7 | Station_Name | Day of Go_live_date | Status_21_7 | change_21_7_10 | change_22_10_7 | |
|---|---|---|---|---|---|---|---|---|
| 35 | 3038 | Inactive | Inactive | The Children's Hospital of Philadelphia (CHOP) | 4/23/2015 | Active | True | False |
| 137 | 3192 | Active | Active | 2nd & Fairmount | 8/14/2018 | Inactive | True | False |
| 154 | 3213 | Active | Inactive | Broad & Carpenter | 9/16/2020 | Inactive | True | True |
| 155 | 3214 | Active | Active | Broad & Cecil B Moore | 9/22/2020 | Inactive | True | False |
#This code will come to an error. 3213 is not active after 2021-10.
pd.crosstab(columns=indego_2021q4['start_station_id'],index=indego_2021q4['start_date'])[3213]
pd.crosstab(columns=indego_2022q1['start_station_id'],index=indego_2022q1['start_date'])[3213]
pd.crosstab(columns=indego_2022q2['start_station_id'],index=indego_2022q2['start_date'])[3213]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File /opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance) 3620 try: -> 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err: File /opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc() File /opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'start_station_id' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Input In [35], in <cell line: 2>() 1 #This code will come to an error. 3213 is not active after 2021-10. ----> 2 pd.crosstab(columns=indego_2021q4['start_station_id'],index=indego_2021q4['start_date'])[3213] 3 pd.crosstab(columns=indego_2022q1['start_station_id'],index=indego_2022q1['start_date'])[3213] 4 pd.crosstab(columns=indego_2022q2['start_station_id'],index=indego_2022q2['start_date'])[3213] File /opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py:3505, in DataFrame.__getitem__(self, key) 3503 if self.columns.nlevels > 1: 3504 return self._getitem_multilevel(key) -> 3505 indexer = self.columns.get_loc(key) 3506 if is_integer(indexer): 3507 indexer = [indexer] File /opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance) 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err: -> 3623 raise KeyError(key) from err 3624 except TypeError: 3625 # If we have a listlike key, _check_indexing_error will raise 3626 # InvalidIndexError. Otherwise we fall through and re-raise 3627 # the TypeError. 3628 self._check_indexing_error(key) KeyError: 'start_station_id'
Since station 3213 is displayed as active in Q4,2021 and inactive in Q2 in 2022, there should be a record of it in the trips dataset for 2021.
To get the date on which the status changed, we execute the following code.
#Get the usage trend of 3213
q3_usage_3213=pd.crosstab(columns=indego_2021q3['start_station_id'],index=indego_2021q3['start_date'])[3213]
#Find the first day of status changed of 3213
q3_usage_3213.astype(str).str.match('0').drop_duplicates(keep='first')
start_date 2021-07-01 False 2021-07-15 True Name: 3213, dtype: bool
#Get the usage trend of 3038
q3_usage_3038= pd.crosstab(columns=indego_2021q3['start_station_id'],index=indego_2021q3['start_date'])[3038]
#Find the first day of status changed of 3038
q3_usage_3038.astype(str).str.match('0').drop_duplicates(keep='first')
start_date 2021-07-01 False 2021-07-08 True Name: 3038, dtype: bool
Since the stations became inactive in Q3 2021, to analyze their usage trend, we will need to import the trip data from Q2 2021. Again, we create a crosstab to get the frequency.
#Generating frequency cosstab
q2q3 = pd.concat([indego_2021q2,indego_2021q3], ignore_index=True, sort=False)
freq_crosstab=pd.crosstab(columns=q2q3['start_station_id'],index=q2q3['start_date'],margins=True).sort_values(by='All',axis=1)
freq_crosstab
| start_station_id | 3250 | 3000 | 3254 | 3240 | 3256 | 3243 | 3251 | 3252 | 3253 | 3247 | ... | 3167 | 3208 | 3007 | 3032 | 3190 | 3057 | 3202 | 3010 | 3212 | All |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| start_date | |||||||||||||||||||||
| 2021-04-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 21 | 23 | 22 | 29 | 20 | 10 | 40 | 31 | 6 | 1293 |
| 2021-04-02 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 21 | 31 | 25 | 36 | 36 | 17 | 49 | 26 | 8 | 1398 |
| 2021-04-03 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 27 | 17 | 24 | 22 | 28 | 84 | 36 | 51 | 85 | 1940 |
| 2021-04-04 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 31 | 27 | 26 | 22 | 35 | 74 | 37 | 36 | 128 | 2050 |
| 2021-04-05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 30 | 36 | 27 | 28 | 24 | 63 | 35 | 41 | 51 | 2092 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2021-09-27 | 0 | 0 | 2 | 2 | 19 | 3 | 5 | 3 | 7 | 2 | ... | 59 | 85 | 59 | 69 | 61 | 47 | 78 | 66 | 37 | 3770 |
| 2021-09-28 | 1 | 0 | 2 | 3 | 19 | 0 | 4 | 7 | 7 | 0 | ... | 45 | 79 | 63 | 61 | 60 | 12 | 61 | 60 | 22 | 3348 |
| 2021-09-29 | 7 | 0 | 8 | 3 | 25 | 1 | 6 | 3 | 5 | 2 | ... | 47 | 71 | 73 | 56 | 70 | 39 | 78 | 72 | 45 | 3937 |
| 2021-09-30 | 4 | 0 | 7 | 3 | 24 | 2 | 6 | 8 | 11 | 3 | ... | 59 | 62 | 53 | 68 | 72 | 17 | 69 | 102 | 47 | 3934 |
| All | 12 | 26 | 41 | 143 | 145 | 161 | 191 | 215 | 217 | 237 | ... | 8015 | 8088 | 8200 | 8470 | 8733 | 9056 | 9484 | 9574 | 10220 | 532354 |
184 rows × 169 columns
Since we hypothesize that the stations which are least used are eventually made inactive, we extract the information for the least used station. We also extract the information about the most used station as well. From here, we can also tabulate the number of trips on each day for stations 3213 and 3038.
#get the usage situation of stations which hadn't changed status
#pick the most used station and the least used(except virtual station3000)
status_normal_usage = freq_crosstab.iloc[:,[0,-2]]
#Get the frequency of recent 60 days before the station is abandoned
status_normal_usage_60d=status_normal_usage.loc[dt.date(2021,5,16):dt.date(2021,7,15)]
#60 days frequency of 3038
usage_3038_60d=pd.crosstab(columns=q2q3['start_station_id'],index=q2q3['start_date'])[3038].loc[dt.date(2021,5,16):dt.date(2021,7,15)]
#60 days frequency of 3213
usage_3213_60d=pd.crosstab(columns=q2q3['start_station_id'],index=q2q3['start_date'])[3213].loc[dt.date(2021,5,8):dt.date(2021,7,15)]
Now to visualize, we use python’s plotly library. We create a line chart containing the frequency for stations 3250, 3212, 3213 and 3038. We plot the data for two months.
# Plotting frequency for stations 3250, 3212, 3213 and 3038 across across two months.
fig = go.Figure(data = [
go.Line(name ='3250(least used)', x=status_normal_usage_60d[::6].index, y=status_normal_usage_60d[::6].iloc[:,0]),
go.Line(name = '3212(most used)',x=status_normal_usage_60d[::6].index, y=status_normal_usage_60d[::6].iloc[:,1]),
go.Line(name='3213',x=usage_3213_60d[::6].index,y=usage_3213_60d[::6]),
go.Line(name='3038',x=usage_3038_60d[::6].index,y=usage_3038_60d[::6]),
])
fig.update_layout(xaxis=dict(showgrid=False),
yaxis = dict(showgrid=False),
title='Daily usage of stations',
title_font_size=24,
title_xanchor='auto',
title_yanchor='middle')
fig.update_yaxes(title='Frequency')
fig.update_xaxes(title='Date')
fig.show()
/opt/anaconda3/lib/python3.9/site-packages/plotly/graph_objs/_deprecations.py:378: DeprecationWarning: plotly.graph_objs.Line is deprecated. Please replace it with one of the following more specific types - plotly.graph_objs.scatter.Line - plotly.graph_objs.layout.shape.Line - etc.
The graph helps us clearly visualize the differences in the frequencies of the number of trips. The stations that are made inactive are not the one’s that were the least used. So our hypothesis was disproved by our analysis.
We conduct a total of five analyses on our datasets to identify and understand different aspects of improving the operational efficiency of Indego. From our analysis conducted above, we find as follows: